This dataset describes the monthly number of sales of shampoo over a 3 year period.
The units are sales count and there are 36 observations.
Dataset source: Makridakis, Wheelwright and Hyndman (1998).
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly import offline
from plotly.subplots import make_subplots
offline.init_notebook_mode()
Firstly, let's expore the dataset!
df = pd.read_csv('../datasets/shampoo.csv')
df.shape
(36, 2)
df.dtypes
Month object Sales float64 dtype: object
df.head(2)
| Month | Sales | |
|---|---|---|
| 0 | 1-01 | 266.0 |
| 1 | 1-02 | 145.9 |
df.tail(2)
| Month | Sales | |
|---|---|---|
| 34 | 3-11 | 581.3 |
| 35 | 3-12 | 646.9 |
We should also check for nan values.
df.isna().sum()
Month 0 Sales 0 dtype: int64
Consider this is a very small and simple dataset, so there's no much data cleaning necessary. We can, however, do so preprocessing and transformations.
Since we are working with time series, it's easier if we represent Month as a datetime type. To do that, we pad the year with a 0, so we can convert it using the pandas' to_datetime function. For this particular dataset we do not care about the actual year of those observations, but do note that the to_datetime function will convert years represented as zero-padded decimal numbers without century as 20xx years.
df['Month'] = df['Month'].apply(lambda x: x.zfill(5))
Now the dates are 0 padded.
df.head(2)
| Month | Sales | diff | |
|---|---|---|---|
| 0 | 01-01 | 266.0 | NaN |
| 1 | 01-02 | 145.9 | -120.1 |
df['Month'] = pd.to_datetime(df['Month'], format='%y-%m')
With the conversion they will look like this:
df.head(2)
| Month | Sales | diff | |
|---|---|---|---|
| 0 | 2001-01-01 | 266.0 | NaN |
| 1 | 2001-02-01 | 145.9 | -120.1 |
Plotly let us see the distribution of sales along the months. To do that, let's create a method that makes line plots, so our code is reusable.
def plot_line(df, x, y, title, template='simple_white'):
fig = px.line(data_frame=df, x=x, y=y,
title=title,
template=template
)
fig.show()
plot_line(df, 'Month', 'Sales', 'Sales of Shampoo by month')
Looking at this plot, we can kinda assume that the sales are increasing, with some peaks and valleys along the way. We also have some tools to understand how this is happening and at what pace. A good way to do that is to decompose the time series components or to look at the difference between each month to see how much the sales are actually increasing or decreasing month by month.
Differencing is a type of Time Series transformation used to stabilize the mean of the time series by removing changes in the level of a time series, and so eliminating (or reducing) trend and seasonality (Hyndman, 2018). This is a way to make a non-stationary time series stationary. Pandas already has a diff function we can use to create another column in our dataset.
df['diff'] = df['Sales'].diff()
plot_line(df, 'Month', 'diff', 'Sales difference by month')
If we want to remove temporal structure, we can perform the diff operation again. The idea is that we repeat this process until all temporal dependencies are removed. The number of times the differencing operations is performed is called the difference order.
df['second_order_diff'] = df['Sales'].diff().diff()
plot_line(df, 'Month', 'second_order_diff', 'Second-order sales difference by month')
Taking the difference between consecutive observations is called a lag-1 difference. The lag difference can be adjusted to suit the specific temporal structure. For time series with a seasonal component, the lag may be expected to be the period (width) of the seasonality.
Let's rename our columns to reflect that.
Putting the three plots together, makes it easier to analyse. To do that, we create a function that stacks multiple plots using Plotly.
df.rename(columns={'diff': 'lag-1', 'second_order_diff': 'lag-2'}, inplace=True)
Putting the three plots together, makes it easier to analyse. To do that, we create a function that stacks multiple plots using Plotly. We also create a function named go_line_plot that creates a single Scatter plot, so we can reuse it later if we need simpler plots.
def go_line_plot(x, y, name):
return go.Scatter(x=x, y=y,
mode='lines',
name=name)
def stacked_subplots(rows, dict_values, df, title):
fig = make_subplots(rows=rows, cols=1)
i=1
for key, value in dict_values.items():
fig.add_trace(
go_line_plot(df[value[0]], df[value[1]], name=key),
row=i, col=1
)
i += 1
fig.update_layout(title_text=title, template='simple_white')
fig.show()
We then create a dictionary with our (x, y) tuples and plot label.
values = {'sales': ('Month', 'Sales'),
'lag-1': ('Month', 'lag-1'),
'lag-2': ('Month', 'lag-2')
}
stacked_subplots(rows=3, dict_values=values, df=df, title='Shampoo Sale Analysis')